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Pairs of RNA molecules transcribed from partially or entirely complementary loci are called c/s-natural antisense tran- 
scripts (ds-NATsj, and they play key roles in the regulation of gene expression in many organisms. A promising exper- 
imental tool for profiling sense and antisense transcription is strand-specific RNA sequencing (ssRNA-seq). To identify cis- 
NATs using ssRNA-seq, we developed a new computational method based on a model comparison framework that in- 
corporates the inherent variable efficiency of generating perfectly strand-specific libraries. Applying the method to new 
ssRNA-seq data from whole-root and cell-type-specific Arabidopsis libraries confirmed most of the known ds-NAT pairs 
and identified 918 additional ds-NAT pairs. Newly identified ds-NAT pairs are supported by polyadenylation data, al- 
ternative splicing patterns, and RT-PCR validation. We found 209 ds-NAT pairs that have opposite expression levels in 
neighboring cell types, implying cell-type-specific roles for ds-NATs. By integrating a genome-wide epigenetic profile of 
Arabidopsis, we identified a unique chromatin signature of ds-NATs, suggesting a connection between ds-NAT tran- 
scription and chromatin modification in plants. An analysis of small-RNA sequencing data showed that -4% of ds-NAT 
pairs produce putative ds-NAT-induced siRNAs. Taken together, our data and analyses illustrate the potential for 
multifaceted regulatory roles of plant ds-NATs. 



[Supplemental material is available for this article.] 

Regulatory RNAs, such as microRNAs, small interfering RNAs 
(siRNAs), and long intergenic noncoding RNAs (lincRNAs), play 
fundamental roles in the control of gene expression in many 
organisms (Pasquinelli 2012; Rinn and Chang 2012). One par- 
ticular type of RNA-mediated gene expression regulation involves 
ris-NATs (natural antisense transcripts), where pairs of antisense 
transcripts are generated from the same genomic locus. ris-NAT 
pairs are a widely occurring phenomenon in eukaryotic organ- 
isms; —20% of human genes and —9% of Arabidopsis genes have 
been predicted to form ris-NAT pairs (Wang et al. 2005; Sun et al. 
2006). The most common type of ris-NAT pairs is formed by the 
overlapping 3' ends of a pair of coding or noncoding RNA tran- 
scripts (Wang et al. 2005). Overlapping 5' ends of a ris-NAT pair or 
complete inclusion of one antisense transcript in another sense 
transcript have also been found (Lapidot and Pilpel 2006). 

ris-NATs regulate gene expression by a variety of mecha- 
nisms such as causing RNA polymerase II collision (Prescott and 
Proudfoot 2002), mediating chromatin modification (Pandey 
et al. 2008; Conley and Jordan 2012; Modarresi et al. 2012; Luo 
et al. 2013; Zhan and Lukens 2013), inducing siRNA (nat-siRNA) 
formation (Borsani et al. 2005; Zhang et al. 2012), and controlling 
translation (Carrieri et al. 2012). To date, nearly 100 pairs of 
functional natural antisense transcripts have been characterized 
in mammalian systems and in plants (Faghihi and Wahlestedt 
2009). We are now seeing an acceleration in the discovery of 
functional NATs involved in human diseases, such as cancer 
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(Morris et al. 2008; Yu et al. 2008) and neurodegenerative disorders 
(Faghihi et al. 2008; Carrieri et al. 2012), metabolic disorders (Li 
et al. 2012), as well as in plant stress responses (Borsani et al. 2005) 
and plant development (Ron et al. 2010). 

Although sequencing of cDNA/EST libraries (Jen et al. 2005; 
Wang et al. 2005) and microarray experiments (Ge et al. 2008; 
Morrissy et al. 2011) have provided preliminary genome-wide as- 
sessments of ris-NATs (Henz et al. 2007; Jin et al. 2008; Chen et al. 
2012), results from these early studies are less than ideal (Weber 
et al. 2007). In recent years, the advent of high-throughput se- 
quencing has provided the opportunity to systematically charac- 
terize the transcriptome at higher coverage and better accuracy 
than conventional approaches (Mortazavi et al. 2008). In partic- 
ular, a number of strand-specific RNA-seq (ssRNA-seq) protocols 
have been designed to characterize antisense transcripts (Levin 
et al. 2010; Passalacqua et al. 2012). Because the length of RNA-seq 
reads is still shorter than the length of coding or noncoding RNAs 
in the ris-NAT pairs, true ris-NAT pairs can only be identified by 
computational tools and statistical inference methods that in- 
tegrate the information from many short reads. For example, 
a study in yeast found over 1000 "antisense units," defined as re- 
gions where antisense reads cover 25% of the annotated sense gene 
regions (Yassour et al. 2010). In other studies, hundreds of anti- 
sense transcripts were identified by simply using minimal numbers 
of antisense reads, such as requiring the presence of at least two 
antisense reads in mouse brain libraries (Parkhomchuk et al. 2009) 
or three antisense reads in mouse intestine libraries (Klostermeier 
et al. 2011). The accuracy of these computational strategies using 
simple thresholds can be affected by the biological variation in the 
RNA-seq data (Glaus et al. 2012), library size differences (Bullard 
et al. 2010), and the efficiency of producing perfectly strand-specific 
libraries (Levin et al. 2010). 
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Statistical analysis of cu-natural antisense genes 



The power of statistical modeling to analyze read count data 
has been demonstrated for many applications, including identifi- 
cation of differentially expressed genes (Anders and Huber 2010; 
Robinson et al. 2010) or quantification of alternative splice iso- 
forms (Trapnell et al. 2010; Li and Dewey 2011). However, most 
existing statistical methods for RNA-seq quantification assume 
that the RNA-seq protocol is either perfectly nonstrand-specific 
or perfectly strand-specific. In the perfectly strand-specific case, a 
sense strand gene is assumed to generate reads that only map to the 
sense strand; the reads that map to the antisense stand are dis- 
carded (Trapnell et al. 2010). However, a recent comprehensive 
study found that ssRNA-seq protocols have variable error rates 
(Levin et al. 2010), defined as the fraction of antisense reads gen- 
erated by the sense gene. Depending on the protocol, from 1% to 
12% of all reads that result from the expression of a sense strand 
gene are mapped to the reverse strand of that gene. This variable 
error rate of strand-specific protocols is the result of complex bio- 
chemical reactions that occur during library preparation and thus 
may never be completely eliminated. Moreover, the error rate 
of the same protocol could change from one library to another, 
causing additional variability between libraries and lowering the 
accuracy of expression quantification. To accurately identify cis- 
NAT pairs using ssRNA-seq data and to further characterize the 
mechanism by which ris-NAT pairs regulate gene expression, we 
introduce a probabilistic method called natural antisense tran- 
scripts identification using RNA-seq (NASTI-seq). NASTI-seq in- 
corporates the variable error rate of a strand-specific protocol in 
a read count model, and therefore outperforms other computa- 
tional approaches in identifying ris-NAT pairs. 

To systematically characterize ris-NATs in Arabidopsis, we 
generated strand-specific RNA libraries from the Arabidopsis root, 
which is an excellent model system for studying spatial and tem- 
poral gene expression patterns (Brady et al. 2007; Petricka et al. 
2012). Since little is known about the cell-type-specific expression 
of ris-NATs in any plant species, we chose to look for ris-NATs in 
the endodermis (ENDO) and cortex (CORT), two cell types that are 
derived from the same stem cell precursor and are the product of an 
asymmetric cell division. By using NASTI-seq and over 120 million 
mapped reads, we not only confirmed approximately 1500 anno- 
tated ris-NAT pairs but also identified an additional 918 candidate 
c/s-NAT pairs, which increases the total known ris-NAT pairs in 
Arabidopsis by >60%. We also identified more than 200 c/s-NAT 
pairs that show opposite expression patterns in the endodermis 
and cortex, suggesting potential cell-type-specific roles of ris- 
NATs. From previously acquired cell-type-specific small RNA se- 
quencing data (Breakfield et al. 2012), we identified dozens of loci 
of cis-NAT-induced siRNAs (nat-siRNA). By analyzing the chro- 
matin marks of the c/s-NAT pairs (Roudier et al. 2011), we found 
a signature in the density of activating chromatin marks in the 
promoter region of the sense gene, suggesting a connection be- 
tween antisense transcription and chromatin modification. Our 
statistical model is very flexible, can be applied to any strand- 
specific protocol to identify c/s-NAT pairs in other biological sys- 
tems, and provides a general framework to incorporate protocol 
error rate (PE) in the analysis of deep sequencing data sets. 

Results 

Strand-specific RNA-seq of Arabidopsis roots 

We generated replicate ssRNA-seq data from total RNA isolated 
from Arabidopsis whole roots (sample names: unWRl, unWR2, 



unWR3; generated from unsorted cells), mock-sorted whole roots 
(WR1, WR2, WR3), as well as cell populations enriched for two 
individual cell types: ENDO (ENDOl, END02, END03; generated 
from sorted cells) and CORT (CORT1, CORT2, CORT3; generated 
from sorted cells). All 12 libraries were made using a modified 
SOLiD Total RNA-Seq (Applied Biosystems) protocol. A linear and 
strand-specific amplification step was used for all 12 libraries ex- 
cept for the unWRl sample (see Methods; Supplemental Fig. 1; for 
details of all libraries, see Supplemental Table 1). Approximately 
120 million 50-bp reads (on average 10 million reads per library) 
were uniquely mapped to the Arabidopsis genome and transcrip- 
tome (Methods; Supplemental Table 2). We found our experimental 
method faithfully retains relative gene expression levels at the ge- 
nome scale and provides high reproducibility of the gene ex- 
pression levels compared with previous data using microarrays 
(Supplemental Methods). 

We calculated the strand-specific protocol error rate (PE), 
which is defined as the fraction of reads mapping to the un- 
expected strand (Levin et al. 2010) of each annotated Arabidopsis 
gene. We found that our strand-specific protocol is highly specific 
(average PE for each library ranges from 2.4% to 3.5%) (Supple- 
mental Table 1). As expected, the protocol is not perfectly strand- 
specific, and the PE varies across different libraries. 

A statistical model to discover natural antisense transcripts 

One simple way to identify antisense transcripts is to count the 
number of reads that map to the unexpected strand (denoted N 0 ) 
(Fig. 1A). Due to the imperfect efficiency of protocols, ssRNA-seq 
usually produces a small number of reads from the unexpected 
strand. In most cases (Fig. 1A, e.g., gene 1), one can expect that N 0 
is small, which is due to the fact that most genes in the genome do 
not show significant antisense transcription. A high number of 
reads from the unexpected strand (i.e., a high N 0 ) (Fig. 1A, gene 3) 
suggests potential antisense transcription. In the case of a c/s-NAT 
pair, some of the unexpected reads could come from a nearby gene. 
For instance, a fraction of the unexpected reads found in gene 3 
can be explained as coming from the extended 3' UTR of gene 2 
(Fig. 1A). 

To account for the efficiency of strand-specific protocols, our 
method (NASTI-seq) models the probability of seeing unexpected 
reads using a binomial distribution and identifies ris-NATs under 
a model comparison framework. Our method takes two consecutive 
steps. First, we calculate a score (NASTI score) based on Bayesian 
information criterion (BIC), which represents the evidence of anti- 
sense transcription for each locus in the genome. In the second step, 
we identify candidate c/s-NAT pairs that have small intergenic 
distances and are on opposite strands in the genome. The NASTI 
scores of the two genes in each candidate pair are combined into 
one score, which is used to identify c/s-NAT pairs. The threshold 
applied to the combined score is determined using training data. 
A detailed description of the statistical method is provided in 
Supplemental Methods. 

To evaluate the performance of NASTI-seq, we performed 10- 
fold cross-validation. c/s-NAT pairs supported by existing annota- 
tions (Wang et al. 2005) were used as a positive training set and 
gene pairs that are unlikely to form natural antisense pairs were 
used as a negative training set (Methods; Supplemental Table 3). 
We first tested NASTI-seq on the whole-root sample (WR1-3), and 
we found that the performance was remarkably high, with an av- 
erage area under the receiver operating characteristic (ROC) curve 
(auROC) of 0.974 (Fig. 1B,C). We then compared NASTI-seq with 
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Figure 1. Statistical framework for the identification of c/s-NAT gene 
pairs. (A) Illustration showing read distribution for a strand-specific pro- 
tocol. Reads that are mapped to the forward and reverse strands of the 
genome are colored in blue and red, respectively. (Blue arrows) Gene 
models (including UTRs). Summary statistics used in the comparisons are 
shown in the table. (N 0 ) Number of reads from the unexpected strand. For 
gene 1 and gene 2, N 0 equals the number of red reads that mapped within 
the annotated gene boundaries. For gene 3, N 0 equals the number of blue 
reads that mapped within the annotated gene boundaries. (N q ) Number 
of reads from the expected strand. (PE) Strand-specific protocol error rate. 
(Neg Count) Number of reads from the unexpected strand. (Neg RPKM) 
Read per kilobase per million reads (RPKM) for the reads from the un- 
expected strand. To calculate RPKM, all genes are assumed to be 2000 
base pairs in length, and the library size is assumed to be one million reads. 
(Count Ratio) The ratio of N 0 and N q . (Logistic Regression) Log probability 
was used to rank genes. (NASTI [BIC score]) The BIC score was used to rank 
genes. (B) Receiver operating characteristic (ROC) curves for the com- 
parison of different classification methods. (FPR) False positive rate. (TRP) 
True positive rate. (Inset) The same curves with FPR between 0 and 0.2. 
(C) Area under the ROC curve (auROC) comparison of different methods. 
For both logistic regression and NASTI-seq, standard deviations of the 
auROC from 1 0-fold cross-validation are shown. (WR) Whole-root sample. 
(ENDO) Endodermis sample. (CORT) Cortex sample. Cross-validation is 
not available for simple methods (Neg Count, Neg RPKM, Count Ratio). 



several recently published methods using read counts or ratios 
(negCount or Count Ratio) (Fig. 1A). The negCount method is 
equivalent to the decision rules based on read count (Klostermeier 
et al. 2011); Count Ratio (Ni et al. 2010) and an equivalent 
method of Count Ratio (Passalacqua et al. 2012) have been used 
to identify antisense transcription. Both negCount and Count 
Ratio used in our comparison are more flexible compared with 
their original description because the thresholds are determined 
by the training data. Additionally, reads per kilobase per million 
of unexpected reads (negRPKM) were also calculated and were 
used to decide c/s-NAT pairs. To compare NASTI-seq to a more 
elaborate statistical model, we trained a logistic regression model 
using the expected read counts and unexpected read counts as 
features to predict c/s-NAT pairs (Supplemental Methods). The 
logistic regression performs better than the count- or ratio-based 
methods, but did not perform as well as NASTI-seq (Fig. 1B,C). 

Our training sets were derived from analysis of an EST data- 
base, which is enriched with ubiquitously and highly expressed 
genes. We would expect that cell-type-specific c/s-NATs are less 
likely to be represented in the training set. However, we found 
that the auROC value of the WR library is indistinguishable from 
that of the ENDO and CORT libraries, suggesting that the power 
of our method is not affected by the type of biological sample 
used (Fig. 1C). The performance of the statistical methods for 
RNA-seq data may be affected by the sequencing depth (Tarazona 
et al. 2011). To evaluate this effect, we performed subsampling 
experiments (Ramskold et al. 2009). We found that the number of 
NAT pairs identified by our approach starts to plateau at 15 mil- 
lion reads (three replicates with 5 million reads in each replicate) 
(Supplemental Fig. 2). Overall, these results show that NASTI-seq 
is very accurate and performs consistently better than the other 
methods tested. 

NASTI-seq identifies hundreds of new ck-NAT pairs 

For each biological sample, we selected a threshold in the com- 
bined BIC score such that the false-discovery rate (FDR) was below 
0.05, and identified c/s-NAT pairs across all genes in the genome. 
In all three samples, we were able to validate approximately 600 
c/s-NAT pairs, accounting for —80% of gene pairs from the posi- 
tive training sets (Fig. 2A, positive training). In each sample, we 
found more than 600 additional gene pairs supported by the 
current Arabidopsis genome annotation (Fig. 2A, TAIR10), dem- 
onstrating the effectiveness of the NASTI-seq algorithm. More 
importantly, in each sample we identified over 500 c/s-NAT pairs 
involving known genes that are not currently annotated as over- 
lapping (Fig. 2A, novel). Finally, we also identified over 400 "or- 
phan" genes in each sample, i.e., individual known genes that do 
not have a candidate partner gene annotated within 500 bp. These 
"orphan genes" may be due to the misalignment of ESTs during 
genome annotation or due to incomplete annotations of the loci 
(see Discussion; Supplemental Fig. 3A). 

Next, we analyzed the tissue distribution of c/s-NAT pairs 
identified by the NASTI-seq algorithm. The c/s-NAT pairs from the 
training set and the pairs that are supported by TAIR10 were 
combined as "known c/s-NAT pairs" (Fig. 2B). Other c/s-NAT pairs 
identified by NASTI-seq were denoted as "novel c/s-NAT pairs" 
(Fig. 2C). We found 1490 known c/s-NAT pairs across all samples, 
with >73% (1085) of these pairs identified in each of the three 
samples. This large number of known cis-NATs that are common 
among all samples is not surprising as the existing annotation is 
largely based on cDNA and EST evidence derived primarily from 
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Figure 2. Types of newly identified c/s-NAT pairs and their tissue dis- 
tribution. (A) Numbers of c/s-NAT gene pairs from the training set, TAIR1 0 
annotation, and novel predictions are plotted and color-coded according 
to sample types. c/s-NAT gene pairs from the training set and from TAIR1 0 
annotation are combined as known c/s-NAT pairs. (8) Distribution of known 
c/s-NAT pairs in different samples. (C) Distribution of novel c/s-NAT pairs in 
different samples. 



whole roots or whole seedlings. Among the known c/s-NAT pairs 
(Fig. 2B), 85.1% of the pairs are formed by overlapping 3' ends of 
two transcripts, whereas 6.5% of pairs are formed by overlapping 
5' ends. There are also 8.4% of the pairs formed by complete in- 
clusion of one transcript in another transcript. Of the 918 novel 
c/s-NAT pairs identified across all three samples, only 31% (288 
pairs) were found in all three samples, suggesting potential cell- 
type-specific regulation of c/s-NAT pairs in Arabidopsis (Fig. 2C). 
Among the novel cis-NAT pairs, 91.1% are formed by overlapping 
3' ends, and 8.9% of the pairs are formed by overlapping 5' ends. 
We did not find any novel transcript pair in which one transcript is 
located within the other transcript. However, the read distribution 
in some of the orphan antisense genes suggests that some of these 
loci annotated with a single gene may in fact contain a pair of 
antisense transcripts of which one transcript is completely em- 
bedded within another transcript. 

Genomic evidence supports the set of novel c/s-NAT pairs 

Because the majority of known natural antisense pairs was found 
to have overlapping 3' ends (Jen et al. 2005), it is possible that one 
gene in a newly identified cis-NAT pair is utilizing an alternative 
polyadenylation site located in the transcribed region of the other 
gene. To test this hypothesis, we analyzed recently published 
polyA-seq data from Arabidopsis seeds and leaves (Wu et al. 2011). 
For each novel cis-NAT pair (Fig. 2C) that has potentially over- 
lapping 3 ' ends, we examined whether a polyA cluster (PAC) on the 
same strand of one gene is actually located in the annotated region 
of the other gene, thus supporting an overlapping gene pair. We 



found that, in each sample type, >80% of the novel cis-NAT pairs 
are supported by one or more PACs (Table 1A). We contrasted 
these findings with distribution of PACs in known cis-NAT pairs 
and a negative control set of gene pairs that are not predicted as 
cis-NATs by the NASTI-seq algorithm. The negative control set (see 
Supplemental Table 3) was required to have the same genomic 
features as the novel cis-NAT pairs; i.e., the intergenic region be- 
tween genes in each pair was <500 bp, and genes in each pair were 
located on opposite strands. Less than 60% of the negative control 
pairs were found to harbor PACs in the antisense genes, while close 
to 90% of known cis-NAT pairs were supported by PACs (Table 1 A), . 
For gene pairs that contain antisense PACs, we also analyzed the 
expression levels of the PACs as measured by reads per million 
(FcPM) (Supplemental Fig. 4). In all samples, we found that the 
average RPM of antisense PACs for the novel cis-NAT pairs and the 
known cis-NAT pairs were both higher than that of the negative 
controls (P < 0.01, Wilcoxon test), suggesting a quantitative dif- 
ference in the PAC signals. 

cis-NAT pairs have been implicated in the process of mRNA 
splicing in both plants and mammals (Chen et al. 2005; Jen et al. 
2005; Morrissy et al. 2011). In Arabidopsis, genes with multiple 
exons are more likely to be found in cis-NAT pairs than randomly 
selected gene pairs. We asked whether the same trend is found in 
the newly identified cis-NAT pairs (Table IB). More multi-exon 
genes were found in the known cis-NAT pairs than in the control 
pairs, confirming previous observations. We also found that the 
novel cis-NAT pairs contain more multi-exon genes than the control 
pairs (P < 1 x 10~ 5 , for each of the three samples, \ 2 test). In 
humans, NATs tend to have slightly shorter introns. For each pair 
of NATs, we designated the gene that has higher expression (as 
measured by RPKM) as the major gene, and the other gene as the 
minor gene. We calculated the intron length distribution of major 
and minor genes in all the cis-NAT pairs identified in our analysis 
(see Methods) and found a small but significant trend for minor 
genes to have shorter introns compared with control genes. 

To experimentally validate novel cis-NATs, we carried out RT- 
PCR on five pairs of novel natural antisense genes (Fig. 3; Supple- 
mental Fig. 5). For instance, AT2G46020 and AT2G46030 are 
predicted as cis-NAT pairs by NASTI-seq. The two genes are en- 
coded on the opposite strands of genomic DNA but are not anno- 
tated as overlapping in their 3' UTR (see gene models in Fig. 3B). 
For each whole-root sample (WR1, WR2 and WR3), we plotted the 
cumulative frequency of reads from the forward strand (Fig. 3A,B, 
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Figure 3. Experimental validation of novel predicted c/s-NAT pairs. 
(A) Cumulative distribution of reads adjacent to the overlapping region of 
AT2C46020 and AT2G46030. (x-axis) Genomic location of chromosome 2 
of Arabidopsis. Reads from different biological replicates are in different 
colors. (Solid lines) Cumulative percentage of reads that are mapped to 
the reverse strand. (Dashed lines) One minus cumulative percentage of 
reads that are mapped to forward strand. (6) Same as A, except that the 
cumulative distribution curves <3% are shown. Gene models are shown 
below the distribution curves. Arrow heads indicate the direction of the 
genes. Thick lines represent the exons; thin lines between thicker lines are 
introns. The locations of primer pairs (LI /Rl and L2/R2) are shown as black 
arrows. (C) Reverse transcription (RT)-PCR validation of the predicted 
overlapping genes. No RT reaction was used as negative control. Genomic 
DNA (gDNA) was used as positive control. 



other replicates as predictors. We found in two of three biological 
samples, R 2 improved after adjustment for antisense reads, sug- 
gesting that our approach improves the accuracy of the expres- 
sion estimation for c/s-NAT genes (Fig. 4). 

To further characterize the expression pattern of natural an- 
tisense genes, we compared the expression levels of c/s-NAT pairs 
to genes that are not predicted to form c/s-NAT pairs. For each pair 
of genes, we designated a major gene and a minor gene based on 
their expression levels. We calculated the fraction of the minor 
gene expression (9) as the ratio of the minor gene expression level 
to the sum of the expression levels of both major and minor genes 
(Methods). More genes in the known and novel c/s-NAT pairs have 
smaller 9 compared with the gene pairs in the complete negative 
training set and with the gene pairs in the negative control (Fig. 5), 
and results were qualitatively similar for the two cell-type-specific 
samples (Supplemental Fig. 6). We also found that 42% of gene 
pairs in the negative training set and 18% of gene pairs in the 
negative control set have 9 ~ 0.5 (Fig. 5), which suggests that both 
genes in a pair are expressed at similar levels. Overall, the similarity 
of the cumulative frequency of "9 between known c/s-NAT pairs 
and predicted c/s-NAT pairs supports the hypothesis that the novel 
c/s-NAT pairs are true c/s-NAT pairs. 

Because ENDO and CORT are two cell types derived from the 
same stem cell precursor, these two samples provide an ideal model 
to study cell-type-specific expression of c/s-NAT pairs. We searched 
for c/s-NAT pairs that display opposite expression levels in these 
two cell types; i.e., the minor gene expression fraction changes from 
above 0.5 in one cell type to below 0.5 in another cell type. For 
instance, AT1G072S0, a UDP-glucosyl transferase, and AT1G07260, 
a GTP binding protein, are predicted to form c/s-NAT pairs. The 
cumulative distributions of the reads from the expected strand of 
both AT1G072S0 and AT1G07260 are clearly overlapping in the 
CORT samples, but less overlap is evident in the ENDO samples 
(Fig. 6A). Intriguingly AT1G07260 is expressed significantly lower 
in CORT samples than in ENDO samples, implying that c/s-NATs 
may be regulating gene expression levels of AT1G07260. Overall, 
we identified 152 pairs of known c/s-NATs (Supplemental Table 4) 



dashed lines) and from the reverse strand (Fig. 3A,B, solid lines). The 
prediction of the NASTI-seq algorithm is supported by the cumu- 
lative distribution of reads, with the 3' UTRs of the two genes 
overlapping in a 1.5-kb region (Fig. 3B). The overlapping tran- 
scripts were further validated using gene-specific primers (LI, Rl 
for AT2G46020 and L2, R2 for AT2G46030) (Fig. 3C) . All five pairs 
of novel natural antisense genes were successfully validated (Sup- 
plemental Fig. 5). 

cw-NAT pairs display unique expression patterns 
and cell-type-specific expression 

One advantage of the proposed statistical model is that the true 
gene expression levels of natural antisense pairs can be estimated 
by statistically un-mixing sense and antisense read count data 
(Methods). We hypothesized that this would improve the corre- 
lation of the gene expression levels between biological replicates. 
Using genes that are predicted to form natural antisense pairs, we 
calculated the pairwise Pearson correlation coefficient (PCC) be- 
tween biological replicates. Only c/s-NAT pairs were used to calculate 
PCC in this analysis, since the expression of genes that do not form 
c/s-NAT pairs will not be changed by the statistical unmixing 
method. To evaluate the overall improvement, we calculated R 2 of 
multiple linear regressions using one replicate as response and two 
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Figure 4. Expression quantification of c/s-NAT genes. (A) Scatter plot of 
gene expression levels of c/s-NAT genes before and after adjustment for 
the protocol error rate. (PCC) Pearson correlation coefficient. (6) R 2 of the 
multiple linear regressions between three biological replicates for each 
sample. This R 2 is the multivariate equivalent of the R 2 for simple pairwise 
comparison. 
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Whole Root Sample 




Figure 5. Cumulative frequency of minor gene fraction pip). Numbers 
next to the green and blue curve show the percentage of gene pairs with 
* less than 0.5. 

and 57 pairs of novel ris-NATs (Supplemental Table 5) that display 
opposite expression levels in the two cell types. 

A functional analysis reveals chromatin signatures 
of cw-NAT pairs 

Epigenetic modification of chromatin states has been implicated 
in the regulatory role of ris-NAT pairs in mammals (Conley and 
Jordan 2012; Magistri et al. 2012) and in plants (Luo et al. 2013; 
Zhan and Lukens 2013). To characterize the relationship between 
the chromatin marks and ris-NAT pairs in plants, we examined 
recently published genome-wide epigenetic profiles of two chro- 
matin marks representing activation and repression, H3K4me3 and 
H3K27me3, in Arabidopsis roots (Roudier et al. 201 1). We calculated 
the average enrichment levels for H3K4me3 and H3K27me3 in the 
proximal region of the transcription start sites (± 1000 bp) for all 
major and minor genes in three sets of gene pairs: the predicted ris- 
NAT pairs, the known ris-NAT pairs, and the negative control pairs 
(Supplemental Table 3). Because genes that are not expressed tend to 
lack both H3K4me3 and H3K27me3 chromatin marks (see Discus- 



sion), we analyzed the chromatin marks for gene pairs that are 
expressed (RPKM > 0.01) in all three biological samples (WR, ENDO, 
and CORT). ris-NAT pairs that are inversely expressed between 
ENDO and CORT samples were excluded because these pairs may 
have different chromatin marks in different cell types. In all three 
sets of gene pairs, H3K4me3 marks peak right after the transcription 
start sites, consistent with the known genomic distribution of 
H3K4me3 (Zhang et al. 2009). Both known ris-NAT genes and pre- 
dicted ris-NAT genes show preferential depletion of the activation 
chromatin mark, H3K4me3, just upstream of the transcription start 
site of the major gene (Fig. 7, left). This result is consistent with 
a recent finding that the antisense gene can affect the epigenetic 
marks of the 5 ' end and the promoter region of the sense gene, even 
though the antisense transcript does not extend to the 5' end of the 
sense gene (Modarresi et al. 2012). The cumulative distribution of 
the expression of major genes in all three sets of gene pairs is near 
identical (Supplemental Fig. 7A), suggesting that the observed dif- 
ference in chromatin mark density is not due to differences in gene 
expression levels. In contrast, the H3K4me3 peak of the minor gene 
is higher in the negative control gene pairs than both the predicted 
ris-NAT pairs and known ris-NAT pairs (Fig. 7, right). This can be 
explained as the expression levels of the minor gene tend to be 
higher in the control gene pairs than both the predicted ris-NAT 
pairs and known ris-NAT pairs (Supplemental Fig. 7B). In contrast to 
the activation mark, we did not find differences in the density of 
H3K27me3, a repression mark, in the three sets of gene pairs ana- 
lyzed (Supplemental Fig. 7C,D). Overall, these observations support 
a unique chromatin signature of the activation mark, H3K4me3, in 
the ris-NAT gene pairs in plants. 

Many small RNAs are mapped to cw-NAT pairs 

Natural antisense genes have been shown to generate siRNAs that 
regulate downstream gene expression under stress conditions in 
plants (Borsani et al. 2005). To study the potential of ris-NAT me- 
diated siRNA (nat-siRNA) formation, we analyzed small RNA li- 
braries from the same cell types (Breakfield et al. 2012). We searched 
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Figure 6. Example of cell-type-specific expression of c/s-NAT pairs. (A) Cumulative distribution of reads from the predicted c/s-NAT pair of ATI C07250 
and ATI G07260. The two annotated genes have an intergenic distance of 60 bp. Expression from cortex (CORT) and endodermis (ENDO) samples are 
both shown. The cumulative distributions show a visible overlapping between two genes in the CORT sample but not in the ENDO sample. (8) Gene 
expression levels of the two genes in two cell types. Bar plot shows the RPKM ± SEs of three biological replicates. 
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Figure 7. Distribution of activating chromatin mark. The distribution of 
H3K4me3 around the transcription start sites of major and minor genes in 
predicted c/s-NAT pairs, known c/s-NAT pairs, and negative control pairs. 

for small RNA reads that map to the genomic regions of both newly 
discovered and known natural antisense RNAs (aRNAs). We found 
85 ris-NAT pairs in the whole-root sample, 93 pairs in the CORT 
sample, and 43 pairs in the ENDO sample that generate small RNA 
reads at a rate of at least 10 RPM (Supplemental Table 6). Using the 
same threshold, a comparable number of nat-siRNAs was identified 
in Arabidopsis seedlings under biotic and abiotic stress conditions 
(Zhang et al. 2012), suggesting the fraction of c/s-NATs that induce 
siRNA is relatively similar across conditions. As reported, both 21-bp 
and 24-bp small RNAs are mapped to ris-NAT pairs (Zhang et al. 
2012). However, only a small fraction of these nat-siRNAs are over- 
lapping between this study and the published study under stress 
conditions, suggesting differential regulation of nat-siRNA during 
development and under stress conditions. 

Discussion 

Using new strand-specific RNA-seq data and a statistical model, we 
have identified an additional 918 ris-NAT pairs in the annotated 
Arabidopsis genes. This significantly increased number of c/s-NATs 
is due to the power of the NASTI-seq model, which accounts for 
biological variability, library size variations, and the strand speci- 
ficity of the RNA-seq protocol. The statistical model not only 
performs better than simple count-based or ratio-based methods 
but also outperforms the widely used logistic regression model, 
indicating the importance of tailoring the statistical model to the 
specific protocols. Several types of independent genomic evidence 
support the predicted ris-NAT pairs. For instance, more antisense 
PACs and multi-exon genes were found in both novel c/s-NAT pairs 
and known c/s-NAT pairs than in negative control pairs, indicating 
that genomic features of the novel c/s-NAT pairs are consistent 
with known c/s-NAT pairs. However, these genomic features alone 
are not sufficient to accurately predict c/s-NAT pairs. For example, 
>50% of genes that do not form c/s-NAT pairs also contain anti- 
sense PACs. Because the PACs (Wu et al. 2011) were identified by 
extensively sequencing polyA tags from whole leaf and seeds — 
organs that contain many cell types — many subtle polyA signals are 
expected to be identified at low abundance. Indeed, the average 
expression levels (RPM) for the polyA signals in the non-ris-NAT 
pairs are smaller than the average RPM of the predicted c/s-NAT pairs 
and the known c/s-NAT pairs (Supplemental Fig. 4), suggesting the 
PACs found in non-ris-NAT pairs may represent rare events. Direct 
sequencing of polyadenylated RNA using novel instruments may 
provide an alternative method to identify new polyA signals that 
play a role in the formation of c/s-NATs (Sherstnev et al. 2012). 

The high performance of NASTI-seq is due to the incorpora- 
tion of the strand-specific efficiency, an important feature that is 
not explicitly modeled by any other published methods. The ef- 



ficiency of biochemical assays is a well-known problem in high- 
throughput experiments. For instance, bisulfite sequencing is a 
protocol for analyzing DNA methylation. The C-to-T conversion 
caused by bisulfite treatment can vary depending on the sample 
and protocol (Bock et al. 2010). PAR-CLIP, a protocol for identify- 
ing RNA-binding protein specificity, also shows site dependent 
T-to-C conversion (Hafner et al. 2010). Our statistical framework 
can be easily applied to model the efficiency in any RNA-seq pro- 
tocol to help improve the accuracy of existing methods. 

Approximately 4% of c/s-NAT pairs in each sample can po- 
tentially generate nat-siRNAs with expression levels above 10 RPM. 
At a less stringent cutoff of 1 RPM, 40% of c/s-NAT pairs in each 
sample coincide with small RNAs. Among the 4% nat-siRNAs that 
passed the stringent threshold, the distribution of small RNA reads 
along the c/s-NAT pairs varies from case to case. In some cases, the 
reads mapped exclusively and evenly to one gene in a pair of cis- 
NATs (Supplemental Fig. 8A), consistent with the model that one 
gene induces the degradation of the other gene. In other cases, the 
reads clustered to a small region, usually at the extreme end of one 
gene model (Supplemental Fig. 8B), consistent with a model of 
stalled Pol II polymerase, which generates small RNAs at the 5' end 
of the target gene (Core et al. 2008). There are six c/s-NAT pairs 
where the nat-siRNAs were also predicted as microRNAs (Breakfield 
et al. 2012), suggesting that some microRNA genes are part of 
c/s-NAT pairs. The heterogeneity in the read distribution of the small 
RNAs suggests different underlying mechanisms that generate 
the observed small RNA reads. Further computational and ex- 
perimental analyses are required to establish these yet unknown 
mechanisms and to identify true nat-siRNAs in plants. 

By using NASTI-seq, we identified more than 500 orphan 
antisense transcripts, i.e., those without nearby candidate genes to 
form c/s-NAT pairs. We found at least two types of orphan antisense 
transcript. Some appear to be misannotations of the gene orien- 
tation: >95% of reads map to the unexpected strand along the 
whole-gene model (Supplemental Fig. 3B) and may be due to EST- 
based annotations in which ESTs aligned to the reverse strand. The 
more interesting orphan genes are those that have sense reads or 
antisense reads mapped to different segments of an annotated 
gene model, such as cases in which the sense reads clustered to- 
gether on one side of the annotation, whereas the antisense reads 
clustered on the other side (Supplemental Fig. 3C). These genes are 
consistent with a model of a c/s-NAT pair, except that the antisense 
gene is not yet annotated. In such cases, more evidence from se- 
quence data or from gene predictors of noncoding genes is re- 
quired to further support the c/s-NAT pair. 

Connections between chromatin modifications and c/s-NAT 
pairs have been observed in mammals (Modarresi et al. 2012) and 
in plants (Luo et al. 2013; Zhan and Lukens 2013). Chromatin 
modification data from plant ChlP-chip has previously been 
summarized at the gene level; e.g., a chromatin mark is based on 
signals of all probes in that gene (Roudier et al. 2011). While the 
gene level summary is useful in differentiating genes into groups 
with predominant chromatin states, the interpretation of c/s-NAT 
pairs is confounded by the closeness of genes in each pair and the 
gene expression levels. By carefully using genes with matching 
expression distributions as controls (Supplemental Fig. 7A), we 
found a depleted H3K4me3 distribution around the transcription 
start site of the major genes in c/s-NAT pairs (Fig. 7), suggesting a 
signature for c/s-NAT genes in Arabidopsis roots. Further experi- 
mental analysis using targeted knockdown of one gene in a c/s-NAT 
pair may provide new insights into a mechanistic model of the 
inter-regulation of c/s-NAT pairs and chromatin marks. 
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In conclusion, our methods demonstrated the power of com- 
bining strand-specific RNA-seq, cell-type-specific gene expression, 
and statistical modeling to discover novel ris-NAT pairs. We in- 
creased the known ris-NAT pairs in Arabidopsis by >60% and found 
many ris-NAT pairs for which the expression levels of the two 
genes in a pair switch between cell types, providing an important 
resource for further functional validation of ris-NAT pairs in 
Arabidopsis. Our statistical framework is general; it can be applied 
to any strand-specific protocol to facilitate the identification of 
natural antisense transcription in many biological systems. 

Methods 

Strand-specific RNA-seq protocol 

The A. thaliana Columbia-0 ecotype was used for the WR libraries. 
All seeds were sterilized with 50% bleach and 0.1% Tween for 
7 min and rinsed five times in sterile water. Seeds were then plated 
on sterile mesh on 1 x MS agar with 1% sucrose. Plates were ver- 
nalized 48 h at 4°C and grown under standard conditions for 6 d. 
For the whole-root unamplified library, roots were harvested from 
the plates and frozen on dry ice. Total RNA was extracted using the 
RNeasy Mini Kit (Qiagen) according to the manufacturer's in- 
structions. Poly-adenylated (Poly-A) mRNA was isolated from the 
total RNA sample by Dynabeads Oligo (dT) (Invitrogen). 

Fluorescence activated cell-sorting was used to isolate GFP- 
marked cell populations to enrich for cells in the ENDO and CORT. 
The same protocol was used to mock sort all of the cells (WR) ac- 
cording to the method described previously (Birnbaum et al. 2003). 
Three biological replicates of the two cell-type-specific (ENDO 
and CORT) and the mock sorted samples were isolated. Total 
RNA was extracted using the RNeasy Micro Kit (Qiagen). T7-RNA 
polymerase-mediated amplification was used to generate aRNA and 
preserve strand specificity using the TargetAmp 1-Round aRNA 
Amplification kit (Epicentre/Illumina). To assess the amplification 
protocol, we prepared total RNA from whole roots. Ten micro- 
grams of this RNA was used to prepare one library (unWRl) with 
poly-A selected mRNA using the standard protocol. The same total 
RNA was diluted as input for two libraries that were amplified from 
10 ng (unWR2) and 100 ng (unWR3) of total RNA. All three mock- 
sorted samples and the ENDO and CORT samples were amplified 
using the same protocol. 

All samples were treated with Tabacco Acid Phosphatase 
(Epicentre/Illumina) to remove 5' caps of the WR mRNA and to 
remove the triphosphate from the amplified RNA. Products were 
purified and used as input for the fragmentation reaction using the 
SOLiD Total RNA-Seq Kit (Applied Biosystems). Libraries were 
constructed according to the manufacturer's protocol; fragmen- 
tation was assessed by Bioanalyzer Chip (Agilent). Libraries were 
sequenced on the SOLiD 4 platform. 

Read alignment and quality assessment 

Read alignment was carried out using TopHat 1.4.0 (Trapnell et al. 
2009) with default parameters, except the parameter for the maxi- 
mal intron length is set to 2000 bp (99% of the known Arabidopsis 
introns are shorter than this threshold). From 4%-10% of reads were 
aligned to more than one location (Supplemental Table 2), most of 
which were found to be aligned to two duplicated rRNA loci in 
chromosome 2 and chromosome 3. Because the dual aligned reads 
fall in very specific loci, the effect of multiple aligned read on the 
quantification of other genes is minimal. We choose to use only 
reads that are uniquely aligned to a single location in our analysis. 
For the initial analysis (Results section one, ssRNA-seq of Arabidopsis 
roots), gene expression levels were summarized using Cufflinks 



(Trapnell et al. 2010) with the TAIR10 GTF file. For other analyses 
that involve c/s-NATs, read counts were obtained using R scripts 
with rsamtools package (http://bioconductor.org/packages/release/ 
bioc/html/Rsamtools.html), and RPKM were calculated using cus- 
tomized R script. For genes with multiple isoforms, the expression 
levels were calculated using the length of the longest isoform. 

Statistical model of antisense transcription 

We employ a model comparison framework to identify ris-NAT 
pairs. We denote the number of mapped reads in replicates i, locus 
/', and strand k as N^, where i = 1. . .1, ; = 1. . J, and k = 0 or 1 . We use 
k = 0 to indicate that the reads are mapped to the expected strand, 
whereas k = 1 indicates the reads are mapped to the unexpected 
strand. We denote the N i; - = ^ Njj k as the total number of reads 
mapped to locus ;'. For each locus in the genome, we first calculate 
the probability of the observed data under a sense only model (M 0 ), 
assuming only the sense gene is expressed at this <Mocus. 

p(N ijk \1> 0 ,M 0 )=B(N in \N ih pe i ) X NB(N ¥ |/ty<r,). (1) 

In Equation 1, the total read count is modeled by a negative bi- 
nomial distribution (NB), and the number of reads from the 
unexpected strand is modeled by a binomial distribution (B). pei 
denotes PE and is estimated from all genes in the genome. The 
parameters (jxy , cr i; ) of negative binomial distribution are estimated 
in the same way as in DESeq (Anders and Huber 2010). 

When a pair of antisense genes is transcribed in one locus, all 
the reads that map to the sense strand are generated from two 
sources. Some of the sense reads are from the sense strand gene; 
others are from the antisense strand gene but map to the sense 
strand due to the imperfect protocol efficiency. We introduce N ijks 
to denote the unobserved true read counts, with s as an indicator 
variable for the strand of the underlying gene that generates the 
read counts. If the read counts are from the sense strand gene, 
then s equals zero, otherwise, s equals one. The full calculation of 
the conditional probability of the observed data given the anti- 
sense model requires integrating over all the possible values of the 
missing data. To reduce the computational complexity, we choose 
to approximate the probability of the observed data under an an- 
tisense model using maximum likelihood estimation of the miss- 
ing data (Equation 2; for details of the model, see Supplemental 
Method): 

p(N i/ta |<P 1 ,M 1 ) = HB(N iils \N ihs ,p ei ) X NH(\' iis \^-<r,is)- (2) 



We find parameters that maximize the posterior probability of data 
given each model separately and then calculate the NASTI score 
(similar to a BIC). 

NASTI score= \o%{p{D\4> u Mi)) - log(p(D\<P 0 ,M 0 )) 

-\ x (rfi-rfo) x log(n), (3) 

where d x and d 0 represent the number of free parameters in <3?j and 
<t>o, and n is the sample size. 

The model was first trained on data from all genes to calculate 
the NASTI score for each gene, and then the maximum of both 
scores for each pair of genes was used as a score to classify positive 
training pairs against the negative training pairs (for details, see 
Supplemental Method). Ten-fold cross validation was carried out, 
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and the model was evaluated by an ROC curve (Fig. IB) and 
auROC (Fig. 1C). 



Epigenetic data and small RNA data 

The epigenetic profiles of all Arabidopsis genes were obtained from 
Roudier et al. (2011). Only data from the full genome tiling array 
was used for this analysis. ChlP-chip data were downloaded from 
the GEO database (http://www.ncbi.nlm.nih.gov/geo/, accession 
GSE24710). For each set of genes, the log2 ratio of IP over input 
signals was averaged in each of 100-bp nonoverlapping windows 
upstream of and downstream from annotated transcription start 
sites. Small RNA alignment was carried out according to the method 
previously described (Breakfield et al. 2012). Four biological repli- 
cates of whole-root samples and two biological replicates of each 
of the ENDO and the CORT samples were used. Reads from bi- 
ological replicates were pooled and only reads that aligned with 
no mismatches were used in the analysis. 

Data access 

The raw data files have been deposited at the NCBI Sequence Read 
Archive (SRA; http://www.ncbi.nlm.nih.gov/sra) under accession 
number SRA057956. The predicted ris-NAT pairs can be found in 
Supplemental Table 8. The NASTI-seq software can be found at 
http://www.genome.duke.edu/labs/ohler/research/NASTIseq/. 
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